Skip to content

Add qwen3.5-fp4-b200-trt-mtp single-node TensorRT-LLM benchmark#1894

Merged
Oseltamivir merged 6 commits into
mainfrom
qwen3.5-fp4-b200-trt-mtp
Jun 24, 2026
Merged

Add qwen3.5-fp4-b200-trt-mtp single-node TensorRT-LLM benchmark#1894
Oseltamivir merged 6 commits into
mainfrom
qwen3.5-fp4-b200-trt-mtp

Conversation

@RohitNagraj

@RohitNagraj RohitNagraj commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Adds the qwen3.5-fp4-b200-trt-mtp config — Qwen3.5-397B-A17B-NVFP4 on B200, single-node TensorRT-LLM with MTP speculative decode — for the 1k/1k and 8k/1k cells with a TP/TEP/DEP parallelism sweep.

  • nvidia-master.yaml: new config entry + MTP search space.
  • qwen3.5_fp4_b200_trt_mtp.sh: trtllm-serve benchmark script; generates the extra-llm-api config (MoE backend, attention-DP / batch-wait settings, MTP speculative config) per parallelism mode.
  • perf-changelog entry.

Note

Low Risk
Benchmark-only wiring (YAML config, launch script, changelog); no production inference, auth, or data-path changes.

Overview
Adds qwen3.5-fp4-b200-trt-mtp so Qwen3.5-397B-A17B-NVFP4 on B200 can be measured with single-node TensorRT-LLM and MTP speculative decode, alongside the existing non-MTP qwen3.5-fp4-b200-trt entry.

nvidia-master.yaml registers the config on tensorrt-llm/release:1.3.0rc18 with 1k/1k and 8k/1k fixed-seq-len cells and a TP / EP / attention-DP search space where every point sets spec-decoding: "mtp".

qwen3.5_fp4_b200_trt_mtp.sh drives trtllm-serve (pytorch backend): disables FlashInfer GDN prefill for MTP, writes qwen3.5-fp4-trt-mtp.yml with MTP (num_nextn_predict_layers: 3), CUTEDSL vs TRTLLM MoE and KV / batch-wait tuning keyed off DP attention and ISL/TP/EP, then runs the standard serving benchmark (optional lm-eval).

perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 7649ae1. Bugbot is set up for automated code reviews on this repo. Configure here.

Add the qwen3.5-fp4-b200-trt-mtp config (Qwen3.5-397B-A17B-NVFP4, B200, 1k/1k
and 8k/1k) with MTP speculative decode across a TP/TEP/DEP parallelism sweep,
the qwen3.5_fp4_b200_trt_mtp.sh benchmark script, and a perf-changelog entry.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7649ae1. Configure here.

- 16
- 32
- 64
- 128

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA graph sizes exceed max batch

Medium Severity

The extra LLM config hardcodes cuda_graph_config.batch_sizes through 128, while trtllm-serve gets --max_batch_size from CONC or CONC/8 (often 4–16 in this recipe). Peer Qwen and TRT-MTP scripts tie CUDA graph capture to MAX_BATCH_SIZE via max_batch_size, so graph warmup can overshoot the runtime batch cap and risk validation failures or excess memory use on low-concurrency jobs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7649ae1. Configure here.

Comment on lines +149 to +159
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend openai \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$(( CONC * 10 ))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing --chat-templates

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it!

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

MTP runs need --use-chat-template on run_benchmark_serving for meaningful
acceptance, matching the other single-node MTP scripts.
@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@Ankur-singh

Ankur-singh commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation.
  • Verified that this PR passes evals.
  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

This is a TRTLLM config, hence no recipe required

Signed: ankur-singh

@Klaud-Cold

Copy link
Copy Markdown
Collaborator

@Ankur-singh Blocks merge: Check 3 fails — the sign-off's Additional detail section has no recipe link (only "This is a TRTLLM config"); this workflow requires a link even for a TRT-LLM config. Open/link a recipe (vllm-project/recipes or sglang cookbook) or the published recipe page.

  • Check 0 — PASS: nvidia-master.yaml owned by @Ankur-singh @kedarpotdar-nv @jgangani (signer listed); the .sh + perf-changelog fall to catch-all * @InferenceX/core, covered.
  • Check 1 — PASS: in-PR head ded5975 has green, non-skipped single-node 1k1k/8k1k / and eval / runs — https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28051750810
  • Check 2 — PASS: gsm8k em_strict ~0.969 across configs, image tensorrt-llm/release:1.3.0rc18 matches the PR config.
  • Check 3 — FAIL: no recipe link present in the sign-off. Major server args (model/TP/EP/attention-DP/MTP/MoE backend/kv-cache fp8) cannot be checked without a linked recipe; a bare claim does not satisfy the standard.

@Oseltamivir

Oseltamivir commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@Oseltamivir Oseltamivir merged commit 07cdcfb into main Jun 24, 2026
26 checks passed
@Oseltamivir Oseltamivir deleted the qwen3.5-fp4-b200-trt-mtp branch June 24, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants